This project is trying to address the question “Is there a significant difference in income between men and women? Does the difference vary depending on other factors such as education, marital status, criminal history, drug use, childhood household factors, profession, etc.”
We are using NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. The National Longitudinal Surveys (NLS) are a set of surveys designed to gather information at multiple points in time on the labor market activities and other significant life events of several groups of men and women. For more than 4 decades, NLS data have served as an important tool for economists, sociologists, and other researchers. The NLSY97 data set contains survey responses on thousands of individuals who have been surveyed every one or two years starting in 1997.
On loading the data from “nlsy97_income.csv”, from the data set you get 8984 observations over 79 variables.
From the 79 given variables given in the data set, I have selected 13 Variables for further analysis which I hypothesised will have the maximum impact on Income Gaps.
The dataset doesn’t come with very descriptive variable names. I changed the variable names to more descriptive names to get better column names.
Below is the list of variables names selected and their definition:
| Variable | Description |
|---|---|
| totalincarceration | Total number of incarcerations reported by the respondent |
| gender | Gender of the Respondent (Male/Female) |
| physicalEmotionalCondition | Respondent’s Physical and Emotional Condition that limits School/Work |
| race | Race of the Respondent |
| biologicalChild | Number of biological children born and residing in the household |
| collegeType | Respondent’s School Information (public, private not-for-profit, private for-profit or Not Attended College) |
| familyIncome | Gross family income in the previous year |
| drugUse | Respondent used Hard Drugs since DLI |
| industry | Type of Industry or Business |
| income | Income received by Respondent Last Year |
| maritalStatus | Spouse received income is use to understand if the respondent is married or not. |
All the factors are currently represented as integers. I used transform() and mapvalues() functions to convert variables to factors and give the factors more meaningful levels
'data.frame': 8984 obs. of 13 variables:
$ totalincarceration : int 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 2 2 ...
$ physicalEmotionalCondition: Factor w/ 5 levels "DontKnow","No",..: 2 4 2 5 2 2 2 2 2 2 ...
$ race : Factor w/ 4 levels "Black","Hispanic",..: 4 2 2 2 2 2 2 4 4 4 ...
$ biologicalChild : int -4 -4 -5 2 1 1 -4 -5 -4 -5 ...
$ collegeType : Factor w/ 6 levels "Invalid","NotInterviewed",..: 6 6 6 6 6 6 6 6 3 2 ...
$ familyIncome : int 50000 81000 150250 -3 130000 55000 14766 66750 110000 -5 ...
$ highestDegree : Factor w/ 10 levels "Associate/Junior college (AA)",..: 2 4 1 4 4 4 3 6 6 8 ...
$ drugUse : Factor w/ 6 levels "DontKnow","No",..: 2 2 2 2 2 2 2 2 2 3 ...
$ industry : Factor w/ 20 levels "ACS SPECIAL CODES",..: 20 15 14 5 15 5 19 20 5 12 ...
$ income : int 70000 83000 -2 29000 76000 15000 -5 -5 54000 -5 ...
$ estimatedIncome : num -4 -4 37500 -4 -4 -4 -5 -5 -4 -5 ...
$ maritalStatus : Factor w/ 6 levels "DontKnow","No",..: 5 5 6 6 6 6 3 3 6 3 ...
Starting with comparing the Income between Male and Female using a simple box plot:
OBSERVATION :The above plot suggests that men earn more than women.
Also from the graph we can observe, the data has many outliers which might influence our interpretation of the data set. The income data is positively skewed which might also affect our inferences.
To get a better understanding of the data set, plotting the data distribution using Q-Q Plot
From this we can see that the data is skewed at the edges. A lot of data points are marked as 0 income and the top 2 percent of the data is top coded to the average value of the top 2 percent earning population ie.180331. We need to clean the data set before moving forward with the analysis as dirty data will give us incorrect results.
The Data taken from NLSY97 is messy and has many issues which need to be addressed first before performing any further analysis. There are various problems in the data
PROBLEM 1
Some data values are coded for all the attributes
| Top Coded Values | Description |
|---|---|
| -1 | Refused to answer |
| -2 | Dont Know |
| -3 | Invalid Skip (Data not retrieved/lost) |
| -4 | Valid Skip (Question not relevant to the respondent) |
| -5 | Respondent not interviewed that year |
PROBLEM 2
Top 2 percent Income values are coded to the average value of the top 2 percent of cases ie.180331 as the respondents were not comfortable with declaring their true income in the survey. This will result in a skewed data set as seen in the above data distribution.
To handle the missing values due to multiple reasons, depending on the attribute I have changed some variables to Not Available(NA) and some interpretted to impute the values in different ways.
| Variable | Refusal | Don’t Know | Invalid Skip | Valid Skip | Non-Interview |
|---|---|---|---|---|---|
| income | Change to Middle value of Estimated Income Range if available | Change to Middle value of Estimated Income Range if available | - | 0 | NA |
| totalincarceration | - | - | - | - | - |
| gender | - | - | - | - | - |
| physicalEmotionalCondition | NA | NA | - | NA | - |
| race | - | - | - | - | - |
| biologicalChild | - | - | NA | NA | NA |
| collegeType | - | - | NA | Did not Attend | NA |
| familyIncome | - | - | NA | - | NA |
| highestDegree | - | - | NA | NA | NA |
| drugUse | Used Drugs | Used Drugs | NA | - | NA |
| industry | - | - | NA | Not Working | NA |
| marital Status | NA | NA | - | Not Married | NA |
As some of the respondents have given an estimated income range rather than their true income,the middle value of the estimated income range can be used to predict their income as this will improve the data set and reduce the number of missing values. They might not be exact but will provide us with a good estimate of their income.
As we are removing the top 2 percent of the income observations to reduce the skewness, it is also important to remove the observations with 0 income as they dont know give us any valid information and the data set will be biased for low income values.
Using the survey question “How much the spouse earn?”, we can infer if the respondent is married or not. Using this information to check if the marital status affects the gender income gap can be inferred.
The top 2% earning observations are top coded to the average value of the top 2 percent of cases, it makes the data set skewed and gives a misrespresentation of the data set. Our observations will be influenced heavily by the top coded values and reduce the accuracy of the model. For these reasons the Top Coded data observations are removed for the data set.
Also, to not have a biased data set, all observations with 0 income are also changed to Not Availabes(NA). Similarly for respondents Family Income, all Top coded observations are changed to Not Available(NA).
After Cleaning the Data Summary for all the variables considered in the model(Number and Factor variables)
FACTOR VARIABLES
| gender | physicalEmotionalCondition | race | collegeType | highestDegree | drugUse | industry | maritalStatus | |
|---|---|---|---|---|---|---|---|---|
| female:2772 | No :4736 | Black :1486 | Private for Non Profit: 178 | High school diploma (Regular 12 year program):2398 | Yes : 213 | EDUCATIONAL, HEALTH, AND SOCIAL SERVICES :1186 | Married :3332 | |
| male :2914 | Yes : 315 | Hispanic:1224 | Private for Profit : 211 | Bachelor’s degree (BA, BS) :1228 | No :5215 | PROFESSIONAL AND RELATED SERVICES : 643 | Not Married:2314 | |
| NA | NA’s: 635 | Mixed : 50 | Public : 733 | GED : 594 | NA’s: 258 | ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES: 569 | NA’s : 40 | |
| NA | NA | Other :2926 | Did Not Attend :4278 | Associate/Junior college (AA) : 422 | NA | RETAIL TRADE : 558 | NA | |
| NA | NA | NA | NA’s : 286 | None : 402 | NA | MANUFACTURING : 383 | NA | |
| NA | NA | NA | NA | Master’s degree (MA, MS) : 298 | NA | (Other) :1993 | NA | |
| NA | NA | NA | NA | (Other) : 344 | NA | NA’s : 354 | NA |
NUMERIC VARIABLES
| income | totalincarceration | familyIncome | |
|---|---|---|---|
| Min. : -2 | Min. :0.0000 | Min. : 0 | |
| 1st Qu.: 17500 | 1st Qu.:0.0000 | 1st Qu.: 30000 | |
| Median : 30000 | Median :0.0000 | Median : 52000 | |
| Mean : 34106 | Mean :0.1338 | Mean : 59394 | |
| 3rd Qu.: 47000 | 3rd Qu.:0.0000 | 3rd Qu.: 81920 | |
| Max. :111131 | Max. :9.0000 | Max. :220250 | |
| NA | NA | NA’s :946 |
After cleaning the data we again look at the distribution of the dataset. This time we observe that the number of outliers has reduced and the data is now more normalised as compared to the dirty data. This will help us build a better predictive model.
After cleaning the data and again comparing the income between men and women:
OBSERVATION : The above plot still suggests that men earn more than women.
The data set is still slightly skewed at the edges and still has some outliers, but we cannot remove the observations as they contribute significantly to the data set. Removing the outlier will not give the correct results for the regression model.
The skewness of data set is due to income not being symmetric and data being positively skewed.
Getting some general statistics for income vs gender to dig deeper into the data set.
| gender | mean | sd | se |
|---|---|---|---|
| female | 30413.03 | 20847.49 | 395.9654 |
| male | 37619.46 | 23858.33 | 441.9725 |
From the above table, we make the same inference that men (Mean = 37619.4629375) earn more than women (Mean = 30413.0328283) on an average. This suggest that there is an income gap between men and women. To check the signifance of the data set we run t.test and wilcoxon rank-sum test
To test the significance of our results from the above table and bar-graphs we perform a T-Test
Welch Two Sample t-test
data: income by gender
t = -12.144, df = 5643.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8369.73 -6043.13
sample estimates:
mean in group female mean in group male
30413.03 37619.46
OBSERVATION : T-Test suggests that the results give above are significant.
The T-Test Results suggests, income are on average 7206.4g higher for men as compared to women (t-statistic -12.14, p=0, 95% CI [-8369.7, -6043.1]g). By observing the P-value (0), we can confirm the impact of gender on income is significant. It supports our hypothesis, that men make more money than women.
As the data is not completely normal at the edges I also performed a wilcoxon test to check the significance of the result (wilcoxon rank-sum test does not take the assumption the data is normally distributed.)
Wilcoxon rank sum test with continuity correction
data: income by gender
W = 3328400, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-8000 -5001
sample estimates:
difference in location
-7000
OBSERVATION : Wilcoxon Test also supports the results produced above.
But we should also take into other factors which might have an impact on income which is not taken into consideration while discussing about gender income wage.
Taking into consideration other variables to check their impact on income and income gap :
Data summary for total number of incarcerations against Income Gap
| totalincarceration | income.gap |
|---|---|
| 0 | 8637.189 |
| 1 | 9199.310 |
| 2 | 9281.953 |
| 3 | 2555.756 |
| 4 | -18862.682 |
| 5 | NaN |
| 6 | NaN |
| 7 | 9930.000 |
| 8 | NaN |
| 9 | NaN |
We start with the effect of total incarceration on income. We observe that the income is correlated with total number of incarcerations by a correlation factor of -0.12. This suggests that with every 1 increase of incarcertion the income reduces by 12.31%.
Now to check the same relation with income we use a box plot to observe income and total incarceration for men and women.
From the above graph, we observe that the income reduces with more number of incarcerations but the income gap is not significantly different for both the genders. Also, we can see that the comparison is not available for a few data points and the data has many outliers which make the results of the variable not so significant.
From the left graph we can observe the effect of total number of incarcerations on income for both the genders. We can see that as the number of incarcerations increase the income reduces. We consider the same effect of total incarceration on both the genders
From the right graph we can observe the effect of total number of incarceration on income gap for both the genders. For both the gender the association with income is negative and almost the same. ( Slight higher for females than males.)
We can also confirm this using the correlation between income and total incarcerations from the graph above. The correlation value for men is -0.184 and for women is -0.091. This tells us that the association between the number of incarceration and the income very loosely seems to depend on the gender. Both for males and females, there is a negative association between the number of incarceration and income.(Greater number of incarcerations will lead to less income).
Finally to check the significance of the results performed above we use an anova test
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 73776198960 73776198960 149.919 <2e-16
totalincarceration 1 66129372004 66129372004 134.380 <2e-16
gender:totalincarceration 1 179606366 179606366 0.365 0.546
Residuals 5682 2796154489170 492107443
gender ***
totalincarceration ***
gender:totalincarceration
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result The p-value (0.546) which is more than 0.05 level, is not significant for gender and total incarceration. The data suggests that there is an association between income and total incarcerations but not a relation betwen income gap and total Incarcerations. Opposite to our hypothesis, total number of incarcerations does not affect the income gap between males and females.
Reject total incarcerations factor from income gap model
Data summary for physical and Emotional Condition against Income Gap
| physicalEmotionalCondition | income.gap |
|---|---|
| No | 7303.051 |
| Yes | 5106.794 |
To check if physical and emotional condition impact the income for men and women we plot bargraphs on the average incomes for both the genders:
The first graph suggests that the physical-emotional condition at school/work which impacts your work does have an effect on the income of the respondent. But if we look at the second graph we notice that the impact is not much on income gap for men and women.
The variance for income gap with “yes” response for physical-emotional condition is very high, while the income gap between “yes” and “no” is not that much making the variable insignificant for the income gap model.
We can support this analysis with an anova test on gender and physical-emotional condition factor on income.
Df Sum Sq Mean Sq F value
gender 1 62497570350 62497570350 126.139
physicalEmotionalCondition 1 23570462845 23570462845 47.572
gender:physicalEmotionalCondition 1 349334444 349334444 0.705
Residuals 5047 2500621017323 495466815
Pr(>F)
gender < 2e-16 ***
physicalEmotionalCondition 5.95e-12 ***
gender:physicalEmotionalCondition 0.401
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
635 observations deleted due to missingness
Result The p-value (0.401) which is more than 0.05 level is not significant for gender and physical-emotional condition. The data suggests that there is an association between income and physical-emotional condition but not a relation betwen income gap and physical health. Against our hypothesis, physical and emotional condition has the same effect on both males and female incomes.
Reject physical-emotional condition factor from income gap model
Data summary for race against Income Gap
| race | income.gap |
|---|---|
| Black | 3401.801 |
| Hispanic | 8718.042 |
| Mixed | 6227.741 |
| Other | 7568.899 |
To check if race impact the income for men and women we plot bargraphs on the average incomes for both the genders:
From the above graphs we notice, that race has an impact on income gap between males and females. The gap is the largest for Hispanic and lowest for Blacks. Also, the income gap for mixed race can be ingored as the variance for mixed race is very high and can be ignored from our conclusion. To confirm our analysis, we can perform an anova test on gender and race over income.
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 73776198960 73776198960 152.277 < 2e-16 ***
race 3 105733716539 35244572180 72.746 < 2e-16 ***
gender:race 3 5810933956 1936977985 3.998 0.00744 **
Residuals 5678 2750918817046 484487287
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result The p-value (0.007) which is less than 0.05 level and is significant for gender and race. The data suggests that there is an association between income gap and race. The results support our hypothesis that race has an effect on the income gap for men and women.
Accept race for income gap model
Data summary for Number of Children at home against Income Gap
| biologicalChild | income.gap |
|---|---|
| 0 | 4584.192 |
| 1 | 11805.627 |
| 2 | 15905.450 |
| 3 | 25452.801 |
| 4 | 25101.957 |
| 5 | 12187.550 |
| 6 | NaN |
We start with the effect of number of children on income. We observe that the income is correlated with number of children at home by a correlation factor of -0.05. This suggests that with every child the income reduces by 5.41%.
Now to check the same relation with income gap we use a box plot to observe income and number of children for men and women.
From the above graph, we notice there is a big gap median income for males and females. Men earn much higher when they children at home as compared to women. Probably because the females have to take care of the children at home, so they cant take up full time jobs. The difference in income is very clearly visible between the two gender sets.
From the left graph we can observe the effect of number of children on income for both the genders. We consider that number of children have the same effect for both the genders and as number of children increase the income reduces.
From the right graph we can observe the effect of number of children have opposite effect on both the genders. For women the association with income is negative as women need to take care of the children at home. While for men the association of number of children with income is postive as they need to earn higher for supporting the family.
I found the opposite correlation between number of children and income for men and women interesting
The correlation value for men is 0.231 and for women is -0.189. This tells us that the association between the number of incarceration and the income seems to depend on the gender. Among females, this association is negative (with more number of children in the house the females earn less), while among males, the association is positive.(Males earn more if there are more number of children at home)
We can confirm our results from anova test on biological Children and gender over income
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 94327983656 94327983656 223.062 <2e-16 ***
biologicalChild 1 1088353752 1088353752 2.574 0.109
gender:biologicalChild 1 55192158147 55192158147 130.516 <2e-16 ***
Residuals 2766 1169680891635 422878124
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result The p-value (0.007) is significant at the 0.05 level for gender and number of children, so the data suggests that there is an association between income, gender and Number of children.
Accept number of children factor for income gap model
Data summary for College Type against Income Gap
| collegeType | income.gap |
|---|---|
| Private for Non Profit | 10825.860 |
| Private for Profit | 6863.177 |
| Public | 8076.693 |
| Did Not Attend | 7033.632 |
To check if college type impact the income for men and women we plot bargraphs on the average incomes for both the genders:
From the above graphs, we observe that the income gap for all types of colleges is almost the same. The income gap is slightly more for “Private-for non profit” college type but at the same time the variance is very high for “Private- for non profit college” (By looking at the error bars in income gap graph). High variance makes the variable non-significant. Due to these reasons we can drop college type variable from our income gap model.
We can support this analysis with an anova test on gender and college type factor on income.
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 73422079347 73422079347 146.786 < 2e-16 ***
collegeType 3 20840428700 6946809567 13.888 5.1e-09 ***
gender:collegeType 3 723213834 241071278 0.482 0.695
Residuals 5392 2697073270722 500199049
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
286 observations deleted due to missingness
Result The p-value (0.695) which is more than 0.05 level is not significant for gender and college type. Against our hypothesis, college type has the same effect on both males and female incomes.
Reject college type from income gap model
Data summary for Drug Use against Income Gap
| drugUse | income.gap |
|---|---|
| Yes | 5110.492 |
| No | 7578.205 |
To check if college type impact the income for men and women we plot bargraphs on the average incomes for both the genders:
From the above graph, we can observe that the variance for “yes” response is very high which makes the yes response insignificant and nullifies the effect of difference in income gap between men and women.
We can support this analysis with an anova test on gender and drug use on income.
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 74256771082 74256771082 147.780 < 2e-16 ***
drugUse 1 5849146011 5849146011 11.641 0.00065 ***
gender:drugUse 1 295670806 295670806 0.588 0.44306
Residuals 5424 2725461157412 502481777
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
258 observations deleted due to missingness
Result The p-value (0.443) which is more than 0.05 level is not significant for gender and drug use. Against our hypothesis, drug use has the same effect on both males and female incomes.
Reject drug use from income gap model
Data summary for Industry against Income Gap
| industry | income.gap |
|---|---|
| ACS SPECIAL CODES | -10116.667 |
| ACTIVE DUTY MILITARY | 13399.900 |
| AGRICULTURE, FORESTRY AND FISHERIES | 21194.571 |
| CONSTRUCTION | 8720.578 |
| EDUCATIONAL, HEALTH, AND SOCIAL SERVICES | 6867.232 |
| ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES | 4307.533 |
| FINANCE, INSURANCE, AND REAL ESTATE | 7054.947 |
| INFORMATION AND COMMUNICATION | 4440.687 |
| MANUFACTURING | 4758.557 |
| MINING | 3780.067 |
| OTHER SERVICES | 5496.095 |
| PROFESSIONAL AND RELATED SERVICES | 2194.134 |
| PUBLIC ADMINISTRATION | 14443.163 |
| RETAIL TRADE | 4449.122 |
| TRANSPORTATION AND WAREHOUSING | 13037.409 |
| UTILITIES | 35126.383 |
| Not Working | 3280.049 |
| WHOLESALE TRADE | 7655.057 |
To check if Industry impact the income for men and women we plot bargraphs on the average incomes for both the genders:
From the bar charts given above, we can see that industry effects both men and women. For example ACS Special Codes, females earn higher than males while in all ther other the males tend to earn a higher income. For some of the industries the variance is very high making the industry varible insignificant for our model for example Information and communcication as well as construction.
As hypothesised, industry does make a difference because some of the industries are more men dominated while some are women dominated. I also had to remove some of the industries from my analysis such as Military where the same set was unevenly distributed with 20 men and 1 woman. Such a data set does not provide us with any important information. With multiple industries behaving differently it is difficult to decide whether to keep the variable or not. To understand the correlation of industry with income gap, using anova test for income with gender and industry
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 67992597522 67992597522 152.671 <2e-16 ***
industry 17 313329119187 18431124658 41.385 <2e-16 ***
gender:industry 17 17323618723 1019036395 2.288 0.0019 **
Residuals 5296 2358599928647 445354971
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
354 observations deleted due to missingness
Result The p-value (0.002) which is less than 0.05 level is significant for gender and industry. Supporting our hypothesis, industry has an impact on the income gap between men and women.
Accept from income gap model
We start by checking the relation between family income and income using scatter plots
The graph is densely populated and there is postive association between income and family income. Since respondent’s income is part of the total final income, we need to check the correlation between income and family income. If they are highly correlated, we need to remove variable from our model as it give us the wrong results for its impact on income.
The above matrix suggest that income is slightly correlated to family income with a correlation factor of 0.41 We can infer from this that family income even though might be significant, its results might be exaggerated due its correlation value.
From the left graph we can observe the effect of family income on income for both the genders. We consider that family income has the same effect for both the genders (increase in family income leads to an increase in income)
For the right graph we consider a different effect on males and females. Though from the graph we can observe almost the same effect of family income on both men and women.
By looking at the graphs above we observe that family income doees have an impact on income and income gap. The correlation value for men is 0.429 and for women is 0.422. For both the genders, there is a positive association between income and family income (More family income lead to more income). But we must also observe the correlation between income and family income from the above matrix.
This suggests that even though family income is signifcant for our model, though due to correlation it does not contribute to our model as much as it is depicted by the results.
We can support this analysis with an anova test on gender and drug use on income.
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 78671277907 78671277907 194.5 < 2e-16 ***
familyIncome 1 419408434310 419408434310 1037.0 < 2e-16 ***
gender:familyIncome 1 5015972393 5015972393 12.4 0.000433 ***
Residuals 4736 1915468049549 404448490
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Family Income is not causal This is the old adage that correlation does not imply causation. In this example, we have strong evidence that higher family income is positively associated with respondents income. This doesn’t mean that decreasing family income will lower the income. The relationship is not causal – at least not in that direction. A more reasonable explanation is that higher income will lead to higher family income.
Result The p-value (0) which is less than 0.05 level is significant for gender and family income. We can observe there is some correlation between income and family income Supporting which supports our our hypothesis, family Income has an impact on the income gap between men and women.
Accept family income for income gap model
Data summary for Marital Status against Income Gap
| maritalStatus | income.gap |
|---|---|
| Married | 10696.111 |
| Not Married | 2582.938 |
To check if marital status impact the income for men and women we plot bargraphs on the average incomes for both the genders:
By looking at the graphs above we observe that the marital status has a very different impact on men and women. The correlation value for men is 0.231 and for women is -0.189. This tells us that the association between marital status and income seems to depend on the gender. Among females, this association is negative (married women earn less compared to unmarried), while among males, the association is positive.(married men earn more than unmarried men)
To check the significance of the results performed above we use an anove test
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 73808131692 73808131692 151.20 < 2e-16 ***
## maritalStatus 1 59363983695 59363983695 121.61 < 2e-16 ***
## gender:maritalStatus 1 22436194453 22436194453 45.96 1.33e-11 ***
## Residuals 5642 2754127161425 488147317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 40 observations deleted due to missingness
Result The p-value (0) which is less than 0.05 level is significant for gender and marital status. Supporting our hypothesis, marital status has an impact on the income gap between men and women.
Accept marital status for income gap model
After checking the all the variables individually with the income gap we have come down to 5 variables using the anova test
All these variables had a significant effect on the income gap between males and females as we observed the results to be when we considered each variable seperately.
Now building a regression model on these variables along with gender to understand the income gap between and men and women:
lm(income ~ gender + race + biologicalChild + familyIncome + industry + maritalStatus, data = nyse_top2removed)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9543.684 | 10547.731 | 0.905 | 0.3657 |
| gendermale | 12128.720 | 900.815 | 13.464 | 0.0000 |
| raceHispanic | 1860.473 | 1057.157 | 1.760 | 0.0786 |
| raceMixed | 5348.945 | 4195.600 | 1.275 | 0.2025 |
| raceOther | 3389.917 | 951.958 | 3.561 | 0.0004 |
| biologicalChild | 606.428 | 380.801 | 1.593 | 0.1114 |
| familyIncome | 0.204 | 0.011 | 18.542 | 0.0000 |
| industryACTIVE DUTY MILITARY | 18168.632 | 11872.683 | 1.530 | 0.1261 |
| industryAGRICULTURE, FORESTRY AND FISHERIES | 785.957 | 11248.101 | 0.070 | 0.9443 |
| industryCONSTRUCTION | 4264.167 | 10520.911 | 0.405 | 0.6853 |
| industryEDUCATIONAL, HEALTH, AND SOCIAL SERVICES | 4466.955 | 10433.196 | 0.428 | 0.6686 |
| industryENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES | -4292.787 | 10471.265 | -0.410 | 0.6819 |
| industryFINANCE, INSURANCE, AND REAL ESTATE | 9820.070 | 10528.107 | 0.933 | 0.3511 |
| industryINFORMATION AND COMMUNICATION | 3260.241 | 10906.055 | 0.299 | 0.7650 |
| industryMANUFACTURING | 4918.710 | 10503.670 | 0.468 | 0.6396 |
| industryMINING | 22045.524 | 11637.037 | 1.894 | 0.0583 |
| industryOTHER SERVICES | -1664.867 | 10530.469 | -0.158 | 0.8744 |
| industryPROFESSIONAL AND RELATED SERVICES | 811.216 | 10476.694 | 0.077 | 0.9383 |
| industryPUBLIC ADMINISTRATION | 13775.708 | 10576.696 | 1.302 | 0.1929 |
| industryRETAIL TRADE | 169.686 | 10464.912 | 0.016 | 0.9871 |
| industryTRANSPORTATION AND WAREHOUSING | 5366.190 | 10587.090 | 0.507 | 0.6123 |
| industryUTILITIES | 13704.909 | 11388.756 | 1.203 | 0.2290 |
| industryNot Working | -7337.938 | 10514.318 | -0.698 | 0.4853 |
| industryWHOLESALE TRADE | 1569.235 | 10651.546 | 0.147 | 0.8829 |
| maritalStatusNot Married | -2558.104 | 906.506 | -2.822 | 0.0048 |
Some of the inferences based on this model
Looking at the p-values, it looks like gendermale is statistically significant predictor of income with a p-value r round(summary(income.lm)[[4]][27*3 - 2],3).
Males earn 12128.7202785 more than females on an average.
Family Income also seems to be an statistically significant predictor of income, but we know that income and Family income have a correlation of 0.41. This reduces the significance of family income on income as this might not depict a causal effect.
Marital Status as predicted in the hypothesis also has an impact on income gap and is significant with P-Value of NA
Now that we have considered the effect of each variable seperated, we can take into account the impact of the interaction between the final variables and gender. To Understand the significance of joint effect of gender and other variables on Income we look at the interaction terms and compare the linear regression models with and without the interaction term. The P- value from the Anova test will tell us if the interaction term is significant and if we should include the term in our final regression model.
Analysis of Variance Table
Model 1: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus + gender:race
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2253 727243144767
2 2250 710098504020 3 17144640747 18.108 1.306e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the P value, we are able to conclude that race has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.
Inference : Race and Gender interaction term is significant with P-Value approx 0.
Analysis of Variance Table
Model 1: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus + gender:biologicalChild
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2253 727243144767
2 2252 711046366673 1 16196778093 51.298 1.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the P value, we are able to conclude that number of children at home has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.
Inference : Number of Children and Gender interaction term is significant with P-Value approx 0.
Analysis of Variance Table
Model 1: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus + gender:familyIncome
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2253 727243144767
2 2252 713875998042 1 13367146724 42.168 1.027e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the P value, we are able to conclude that family income has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.
Inference : Family Income and Gender interaction term is significant with P-Value approx 0.
Analysis of Variance Table
Model 1: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus + gender:industry
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2253 727243144767
2 2238 720573077386 15 6670067381 1.3811 0.1473
Looking at the P value, we are able to conclude that industry does not have an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is not significant.
Inference : Industry and Gender interaction term is not significant with P-Value approx 0.147.
Analysis of Variance Table
Model 1: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus
Model 2: income ~ gender + race + biologicalChild + familyIncome + industry +
maritalStatus + gender:maritalStatus
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2253 727243144767
2 2252 710131763595 1 17111381172 54.264 2.446e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the P value, we are able to conclude that marital status has an impact on the income gap as the P value we get from comparing the linear regression model with and without interaction term is significant.
Inference : Marital and Gender interaction term is significant with P-Value approx 0.
From the five interaction models we tested, only Industry-Gender interaction term was not significant, which we can also observe from the graphs above because many industries were not significantly related to income gap in the given data set. So we will include the interaction terms for 4 variables (Race, Marital Status, Number of children and Family Income) but not for Industry.
From all the given observarions I would like to concentrate on the affect of Marital Status of the respondent on Income gap. As per the studies done, married men earn more than unmarried men while it is the opposite for women, unmarried women earn more than married women.
lm(income ~ race + industry + familyIncome + biologicalChild + gender * maritalStatus, data = nyse_top2removed)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9833.657 | 10425.291 | 0.943 | 0.3457 |
| gendermale | 15572.081 | 1005.597 | 15.485 | 0.0000 |
| maritalStatusNot Married | 2502.075 | 1128.999 | 2.216 | 0.0268 |
| raceHispanic | 2254.860 | 1046.248 | 2.155 | 0.0313 |
| raceMixed | 5962.603 | 4147.703 | 1.438 | 0.1507 |
| raceOther | 3626.406 | 941.448 | 3.852 | 0.0001 |
| industryACTIVE DUTY MILITARY | 14662.186 | 11744.429 | 1.248 | 0.2120 |
| industryAGRICULTURE, FORESTRY AND FISHERIES | -1185.624 | 11120.673 | -0.107 | 0.9151 |
| industryCONSTRUCTION | 2400.648 | 10401.785 | 0.231 | 0.8175 |
| industryEDUCATIONAL, HEALTH, AND SOCIAL SERVICES | 3043.537 | 10313.822 | 0.295 | 0.7680 |
| industryENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES | -5643.421 | 10351.263 | -0.545 | 0.5857 |
| industryFINANCE, INSURANCE, AND REAL ESTATE | 8190.332 | 10408.173 | 0.787 | 0.4314 |
| industryINFORMATION AND COMMUNICATION | 1560.670 | 10781.848 | 0.145 | 0.8849 |
| industryMANUFACTURING | 3214.942 | 10384.244 | 0.310 | 0.7569 |
| industryMINING | 19211.606 | 11508.302 | 1.669 | 0.0952 |
| industryOTHER SERVICES | -3365.200 | 10410.715 | -0.323 | 0.7465 |
| industryPROFESSIONAL AND RELATED SERVICES | -491.753 | 10356.516 | -0.047 | 0.9621 |
| industryPUBLIC ADMINISTRATION | 11928.053 | 10456.854 | 1.141 | 0.2541 |
| industryRETAIL TRADE | -1449.321 | 10345.695 | -0.140 | 0.8886 |
| industryTRANSPORTATION AND WAREHOUSING | 3571.168 | 10466.955 | 0.341 | 0.7330 |
| industryUTILITIES | 11552.863 | 11260.263 | 1.026 | 0.3050 |
| industryNot Working | -8780.117 | 10394.036 | -0.845 | 0.3984 |
| industryWHOLESALE TRADE | -691.998 | 10532.300 | -0.066 | 0.9476 |
| familyIncome | 0.208 | 0.011 | 19.029 | 0.0000 |
| biologicalChild | 162.269 | 381.177 | 0.426 | 0.6704 |
| gendermale:maritalStatusNot Married | -12482.300 | 1694.484 | -7.366 | 0.0000 |
Looking at the p-values, it looks like gendermale is statistically significant predictor of income with a p-value r round(summary(final.income.lm)[[4]][27*3 - 2],3).
SOME OF THE INTERESTING OBSERVATIONS IN THE FINAL REGRESSION MODEL
Looking at the analysis, we can confirm with some confidence that Males do earn more than Women. Males earn 15572.081 more than females on an average.
The income difference among married men and married women is $ 12482.3 less than the difference between unmarried men and unmarried women. Marital Status is a factor that is highly associated with the income gap between men and women.
The income gap is not significantly dependent on number of children at home as assumed in the hypothesis as the P-value for the given variable is less than 0.05.
The Estimated income gap between unmarried men and unmarried is equal to the sum of “co-efficients of gendermale” + “co-efficients of gendermale:maritalStatusNot Married”
= 3090
Since married is the baseline for the regression model, the estimated income between men and women for married respondents = “Gendermale” + 0
= 15572
These four plots are important diagnostic tools in assessing whether the linear model is appropriate. The first two plots are the most important.
Residuals vs. Fitted
We see that there’s a clear “funneling” phenomenon. The distribution of the residuals is quite well concentrated around 0 for small fitted values, but they get more and more spread out as the fitted values increase. This is an instance of “increasing variance”. The standard linear regression assumption is that the variance is constant across the entire range. As this assumption isn’t valid, these are clear indicators that the given linear model is inappropriate.
Normal QQ plot Residuals deviate from the diagonal line in both the upper and lower tail. The tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “steeper” line than the diagonal. The p-values to be believable, the residuals from the regression must look approximately normally distributed. So we can consider the P- Values with confidence.
Scale-location plot This is another version of the residuals vs fitted plot. As the first plot has funneling phenomenon, similarly the Red line should be horizontal but it is tilted suggesting the model is not completely accurate.
Residuals vs Leverage The data is positively skewed from the begining. Even after removing the top coded values from the data set, income is still postively skewed. For this reason, we have outliers. Points with high residual (poorly described by the model) and high leverage (high influence on model fit) are outliers. They’re skewing the model fit away from the rest of the data.
Of the so many variables hypothesised only Race, Marital Status and Industry have a major impact on the income gap. Family income, even though considered as signifcant cannot be taken as significant as there is a correlation between income and family income. The relation between income and family income is more of a consequence than a causal effect.
From the given study, we can say with some confidence that there is a significant difference in income between men and women. But this variation is also dependent on variables such as Race, Marital Status and Industry. Marital Status specially interest me as the effect on men and women was large.
I believe this gender pay gap is also prevelant due to various factors such as culture, society structure and historical factors which cannot be quantified. Some of these reasons are reflected by the factors such as race and marital Status which were studied in this analysis.
The data given by NLSY was dirty and incomplete. The data had many top coded values which if not accounted would have reduced our data set considerably. To use the given data set for making inferences, we made many assumptions and imputed multiple values. The changes made in the data set and the assumptions taken before the analysis will influence our final results.
The assumptions made during our analysis:
Independence - We have collected this data over a long duration of time. To perform linear regression, we have to assume that the data is independent. For this reason our analysis might be inaccurate.
Normality - The data is not completely normalized especially at the edge. The data set is skewed as we can see from Q-Q Plot generated. This linear model expression may not be good reflector for very low and high income respondents.
Imputations - Due to so many incomplete values and coded values from -1 to -5, many assumptions are made to make imputations. These imputations will have an impact on the results produced in the end.
Top Coded Values - The top coded values are removed from the data set as we observed from the Q-Q Plot, the top coded income values were making the data set skewed which will influence our results.
These assumptions have influenced the data, leading to a biased linear regression model. Some of the potential limitations of my analysis are
In this section you should summarize your main conclusions. You should also discuss potential limitations of your analysis and findings. Are there potential confounders that you didn’t control for? Are the models you fit believable?
As the data was not clean and not all variables were taken into consideration while building the regression model, I do not have high condifence level on my model. The training set used for building the linaer regression model was not completely normally distributed which is an assumption taken by linear regression, making this model not that reliable. The conclusions made by my analysis can be considered as possible reflection of the real world scenario but cannot be generalised to a very large scale. Policy makers can use the study to understand the possible trends but not use to make important decisions just on the basis of this study.